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Abstract — Nodes in real-world networks organize into 
densely linked communities where edges appear with high con- 
centration among the members of the community. Identifying 
such communities of nodes has proven to be a challenging 
task mainly due to a plethora of definitions of a community, 
intractability of algorithms, issues with evaluation and the lack 
of a reliable gold-standard ground-truth. 

In this paper we study a set of 230 large real-world social, 
collaboration and information networks where nodes explicitly 
state their group memberships. For example, in social networks 
nodes explicitly join various interest based social groups. We 
use such groups to define a reliable and robust notion of 
ground-truth communities. 

We then propose a methodology which allows us to compare 
and quantitatively evaluate how different structural definitions 
of network communities correspond to ground-truth communi- 
ties. We choose 13 commonly used structural definitions of net- 
work communities and examine their sensitivity, robustness and 
performance in identifying the ground-truth. We show that the 
13 structural definitions are heavily correlated and naturally 
group into four classes. We find that two of these definitions, 
Conductance and Triad-participation-ratio, consistently give 
the best performance in identifying ground-truth communities. 
We also investigate a task of detecting communities given 
a single seed node. We extend the local spectral clustering 
algorithm into a heuristic parameter-free community detection 
method that easily scales to networks with more than hundred 
million nodes. The proposed method achieves 30% relative 
improvement over current local clustering methods. 

I. Introduction 

Networks are a natural way to represent social ll20l . bio- 
logical |23|. technological lfl6l . and information J8] systems. 
Nodes in such networks organize into densely linked groups 
that are commonly referred to as network communities, clus- 
ters or modules [11]. There are many reasons why nodes in 
networks organize into densely linked clusters. For example, 
society is organized into social groups, families, villages and 
associations Q, lfl2l . On the World Wide Web, topically 
related pages link more densely among themselves (8). And, 
in metabolic networks, densely linked clusters of nodes are 
related to functional units, such as pathways and cycles ll23l . 

To extract communities from a given undirected network, 
one typically chooses a scoring function (e.g., modularity) 
that quantifies the intuition that communities correspond to 
densely linked sets of nodes. Then one applies a procedure to 

This paper has been published in the Proceedings of 2012 IEEE Inter- 
national Conference on Data Mining (ICDM), 2012. 



find sets of nodes with a high value of the scoring function. 
Identifying such communities in networks lfT4l . ||6l . ll26l . (9) 
has proven to be a challenging task ifTOl . lfl"8"l . ifTTl due to 
three reasons: There exist multiple structural definitions of 
network communities J5), 11241 : Even if we would agree on 
a single common structural definition (i.e., a single scoring 
function), the formalizations of community detection lead 
to NP-hard problems [26 1 ; And, the lack of reliable ground- 
truth makes evaluation extremely difficult. 

Currently the performance of community detection meth- 
ods is evaluated by manual inspection. For each detected 
community an effort is made to interpret it as a "real" 
community by identifying a common property or external 
attribute shared by all the members of the community. For 
example, when examining communities in a scientific col- 
laboration network, we might by manual inspection discover 
that many of detected communities correspond to groups of 
scientists working in common areas of science 11221 . Such 
anecdotal evaluation procedures require extensive manual 
effort, are non-comprehensive and limited to small networks. 

A possible solution would be to find a reliable definition 
of explicitly labeled gold-standard ground-truth communi- 
ties. Using such ground-truth communities would allow for 
quantitative and large-scale evaluation and comparison of 
network community detection methods. Such ability would 
enable the field to move beyond the current standard of 
anecdotal evaluation of communities to a comprehensive 
evaluation of community detection methods based on their 
performance to extract the ground-truth. 

The contributions of our work are three fold. First, we 
describe a set of 230 large social and information networks 
where we define ground-truth communities in a reliable 
way. Second, based on the ground-truth we quantitatively 
evaluate 1 3 commonly used structural definitions of network 
communities and examine their robustness and sensitivity 
to noise. Third, we extend the local spectral clustering 
algorithm into a parameter-free community detection method 
that scales to networks of hundreds of millions of nodes. 

Present work: Ground-truth communities. Next we de- 
scribe the proposed definition of ground-truth communities 
and argue why it corresponds to "real" communities. 

Generally, after communities are identified based on the 
structure of given network, the essential next step is to 



interpret them by identifying a common external property or 
function that the members of a given community share and 
around which the community organizes Q. For example, 
given a protein-protein interaction network of a cell we 
identify communities based on the structure of the network 
and then find that these communities correspond to real 
functional units of a cell. Thus, the goal of community 
detection is to identify sets of nodes with a common 
(often external/latent) function based only the connectiv- 
ity structure of the network. A common function can be 
common role, affiliation, or attribute lfl2l . In our protein 
interaction network example above, such common function 
of nodes would be 'belonging to the same functional unit'. 
Community detection methods identify communities based 
on structure while the extracted communities are evaluated 
based on their function. So we distinguish between structural 
and functional definitions of communities. We use common 
function of nodes to define ground-truth communities. 

Present work: Networks with ground-truth. We gathered 
230 networks from a number of different domains and 
research areas where nodes explicitly state their ground-truth 
community memberships. Our collection consists of social, 
collaboration and information networks for each of which 
we find a robust functional definition of ground-truth. 

For example, in online social networks (like, Orkut, 
LiveJournal, Friendster and 225 different Ning networks) we 
consider explicitly defined interest based groups (e.g., fans 
of Lady Gaga, students of the same school) as ground-truth 
communities. Nodes explicitly join such groups that organize 
around specific topics, interests, and affiliations Q, lfT2l . 
Next, we also consider the Amazon product co-purchasing 
network where we define ground-truth using hierarchically 
nested product categories. Here all members (i.e., products) 
of the same ground-truth community share a common func- 
tion or purpose. Last, in the scientific collaboration network 
of DBLP we use publication venues as proxies for ground- 
truth research communities. Our reasoning here is that in 
scientific collaboration networks, real communities would 
correspond to areas of science. Thus, we use journals and 
conferences as proxies for scientific communities. 

Present work: Methodology and findings. The ground- 
truth allows us to examine how well various structural defini- 
tions of network communities correspond to real functional 
groups (i.e., ground-truth communities). A good structural 
definition of a community would be such that it would detect 
connectivity patterns that correspond to real groups (i.e., 
the ground-truth). This means that we can evaluate differ- 
ent structural definitions based on their ability to identify 
connectivity structure of ground-truth communities. 

We study 13 commonly used structural definitions of com- 
munities and examine their quality, sensitivity and robust- 
ness. Each such definition corresponds to a scoring function 
that scores a set of nodes based on their connectivity. A 



high score means that a set of nodes closely resembles 
the connectivity communities. By comparing correlations of 
scores that different structural definitions assign to ground- 
truth communities, we find that the 13 definitions naturally 
group into four distinct classes These classes correspond 
to definitions that consider: (1) only internal community 
connectivity, (2) only external connectivity of the nodes 
to the rest of the network; (3) both internal and external 
community connectivity, and (4) network modularity. 

We then consider an axiomatic approach and define four 
intuitive properties that communities would ideally have. 
Intuitively, a "good" community is cohesive, compact, and 
internally well connected while being also well separated 
from the rest of the network. This allows us to characterize 
which connectivity patterns a given structural definition 
detects and which ones it misses. We also investigate the 
robustness of community scoring functions based on four 
types of randomized perturbation strategies. Overall, evalu- 
ation shows that scoring functions that are based on triadic 
closure l29l and the conductance score [[27l best capture the 
structure of ground-truth communities. 

Last, we also investigate a task of detecting communities 
from a single seed node. The task is to discover all members 
of a community from a single seed member node. We extend 
the local spectral clustering algorithm into a parameter- 
free community detection method that scales to networks of 
hundreds of millions of nodes. Our method recovers ground- 
truth communities with 30% relative improvement in the 
Fl -score over the current local graph partitioning methods. 

To the best of our knowledge our work is the first to 
use social and information networks with explicit commu- 
nity memberships to define an evaluation methodology for 
comparing network community detection methods based on 
their accuracy on real data. We believe that the present work 
will bring more rigor to the standard for the evaluation 
of community detection methods. All our datasets can be 
downloaded at http://snap.stanford.edu 

II. Community scoring functions 

AND DATA SETS 

We start by describing the network datasets and our 
proposed functional definitions of ground-truth communi- 
ties. Then we continue with outlining 13 commonly used 
structural definitions of network communities. 

Networks with ground-truth communities. Overall we 
consider 230 large social, collaboration and information 
networks, where for each network we have a graph and a set 
of functionally defined ground-truth communities. Members 
of these ground-truth communities share a common function, 
property or purpose. Networks that we study come from a 
wide range of domains and sizes. Table U gives the networks. 

First, we consider online social networks (the LiveJournal 
blogging community [4|, the Friendster online network fl20l . 
and the Orkut social network ll20l ) where users create 



Dataset 


N 


E 


C 


S 


A 


Livejournal 


4.0M 


34.9M 


311,782 


40.06 


3.09 


Friendster 


117.7M 


2,586.1M 


1,449,666 


26.72 


0.32 


Orkut 


3.0M 


117.2M 


8,455,253 


34.86 


95.9 


Ning (225 nets) 


7.0M 


35.5M 


137,177 


46.89 


0.92 


Amazon 


0.33M 


0.92M 


49,732 


99.86 


14.83 


DBLP 


0.42M 


1.34M 


2,547 


429.79 


2.56 



Table I 

230 SOCIAL, COLLABORATION AND INFORMATION NETWORKS WITH 
EXPLICIT GROUND-TRUTH COMMUNITIES. N: NUMBER OF NODES, E: 
NUMBER OF EDGES, C: NUMBER OF COMMUNITIES, S: AVERAGE 

COMMUNITY SIZE, A: COMMUNITY MEMBERSHIPS PER NODE. NlNG 
STATISTICS ARE AGGREGATED OVER 225 DIFFERENT SUBNETWORKS. 

explicit functional groups to which others then join and 
share content. These groups are created based on specific 
topics, interests, hobbies and geographical regions. For ex- 
ample, Livejournal categorizes groups into the following 
types: culture, entertainment, expression, fandom, gaming, 
life/style, life/support, sports, student life and technology. 
Similarly, in other social networks considered in this study 
users define topical communities that others then join. We 
consider each such explicit interest-based group as a ground- 
truth community. Similarly, we have a set of 225 different 
online social networks lfT3l that are all hosted by the Ning 
platform. It is important to note that each Ning network is 
a separate social network — an independent website with 
a separate user community. For example, the NBA team 
Dallas Mavericks and diabetes patients network TuDiabetes 
all use Ning to host their separate online social networks. 
After joining a specific network, users then create and join 
groups. For example, in TuDiabetes, Ning network groups 
form around specific types of diabetes, different age groups, 
and similar. Note that these are exactly the properties around 
which we expect communities to form in a network of 
diabetes patients. Again, we use such explicitly defined 
functional groups as ground-truth communities. 

The second type of network we consider is the Ama- 
zon product co-purchasing network lTT6l . The nodes of the 
network represent products and edges link commonly co- 
purchased products. Each product (i.e., node) belongs to 
one or more hierarchically organized product categories and 
products from the same category define a group which 
we view as a ground-truth community. Note that here the 
definition of ground-truth is somewhat different. In this case, 
nodes that belong to a common ground-truth community 
share a common function or purpose. 

Finally, we also consider the DBLP scientific collabora- 
tion network (4) where nodes represent authors and edges 
connect authors that have co-authored a paper. To define 
ground-truth in this setting we reason as follows. Commu- 
nities in a scientific domain correspond to people working 
in common areas and subareas of science. However, note 
that publication venues serve as good proxies for scientific 
areas: People publishing in the same conference form a 
scientific community. Thus we use publication venues (i.e., 
conferences) as ground-truth communities which serve as 



proxies for highly overlapping scientific communities around 
which the collaboration network then organizes. 

All our networks and the corresponding ground- 
truths are complete and publicly available at 
http://snap.stanford.edu/data The results we present 
here are consistent and robust across a wide range of 
networks and across an even wider range of groups. This 
gives further evidence that our approach is general and 
well-founded. Our work is consistent with the premise that 
is implicit in all community detection works: members 
of structural communities share some functional role or 
property that serves as an organizing principle of the 
network. Here we use functionally defined groups as 
labeled ground-truth communities. 

Note that our work is fundamentally different from Ahn 
et al. HI, who evaluated communities with attribute based 
node-node similarity of the members. This approach, for ex- 
ample, folds all social dimensions (family, school, interests) 
around which separate communities form into one similarity 
metric |fl9l . In contrast, we do not use node similarity to 
define communities. Rather, we harness explicitly labeled 
functional groups as labels of ground-truth communities. 

Data preprocessing. To represent all networks in a con- 
sistent way we consider each network as an unweighted 
undirected static graph. Because members of the group 
may be disconnected in the network, we consider each 
connected component of the group as a separate ground-truth 
community. However, we allow ground-truth communities to 
be nested and to overlap. 

Community scoring functions. We now proceed to discuss 
various scoring functions that characterize how community - 
like is the connectivity structure of a given set of nodes. 
The idea is that given a community scoring function, one 
can then find sets of nodes with high score and consider 
these sets as communities. All scoring functions build on 
the intuition that communities are sets of nodes with many 
connections between the members and few connections from 
the members to the rest of the network. There are many 
possible ways to mathematically formalize this intuition. We 
gather 13 commonly used scoring functions, or equivalently, 
13 structural definitions of network communities. Some 
scoring functions are well known in the literature, while 
others are proposed here for the first time. 

Given a set of nodes S, we consider a function f(S) 
that characterizes how community-like is the connectivity 
of nodes in S. Let G(V, E) be an undirected graph with 
n = \V\ nodes and m = \E\ edges. Let S be the set of 
nodes, where n$ is the number of nodes in S, ns = \S\; 
ms the number of edges in S, mg = \{(u, v) G E : u G 
S, v G S}\; and c$, the number of edges on the boundary 
of S, c s = \{(u, v) G E : u G S,v ^ S}\; and d(u) is the 
degree of node u. We consider 13 scoring functions f(S) 
that capture the notion of quality of a network community 




Figure 1. Clusters based on correlations of community scoring functions. 

S. The experiments we will present later reveal that scoring 
functions naturally group into the following four classes: 
(A) Scoring functions based on internal connectivity: 

is the internal 



Internal density: f(S) 
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ragi ng — J 

edge density of the node set S 1241 . 

• Edges inside: f(S) = trig is the number of edges 
between the members of 5 l24l . 

• Average degree: f(S) = is the average internal 
degree of the members of S 1241 . 

• Fraction over median degree (FOMD): 

f(S) = \{^s,\{(u,vy.ves}\>d m} \ is the fraction Qf 

nodes of S that have internal degree higher than d m , 
where d m is the median value of d(u) in V . 
. Triangle Participation Ratio (TPR): 

is the fraction of nodes in S that belong to a triad. 
(B) Scoring functions based on external connectivity: 

• Expansion measures the number of edges per node that 
point outside the cluster: f(S) = -jf- 11241 . 

• Cut Ratio is the fraction of existing edges (out 
of all possible edges) leaving the cluster: f(S) = 

0. 



ns (n—ns) 

(C) Scoring functions that combine internal and exter- 
nal connectivity: 



Conductance: f(S) 



measures the fraction 



2ms +cs 

of total edge volume that points outside the cluster 11271 . 
Normalized Cut: f(S) = „ c i + Tt c - q , , J27). 

J \ I 2m s +c s 2(m-m 3 )+cs * ' 

Maximum-ODF (Out Degree Fraction): 



\{(u,v)£E:v£S}\ 
d(u) 



is the maximum frac- 



f(S) = max ueS 

tion of edges of a node in S that point outside S (8). 
. Average-ODF: f(S) = ± E MgS lfi y ) i- 
the average fraction of edges of nodes in S that point 
out of S H. 

> Flake-ODF: f(S) = K M: " e ^l{("- 1 ') eg:, ' gS }l< d (")/ 2 }l 
is the fraction of nodes in S that have fewer edges 
pointing inside than to the outside of the cluster J8). 

(D) Scoring function based on a network model: 

• Modularity: f(S) = \(ms — E(ms)) is the difference 
between nig, the number of edges between nodes in S 
and E(ms), the expected number of such edges in a 
random graph with identical degree sequence lETI . 

Experimental result: Four classes of scoring functions. 

Next we examine relationship the 13 community scoring 



functions we introduced. For each of the 10 million ground- 
truth communities in our networks, we compute a score 
using each of the 13 scoring functions. We then create 
a correlation matrix of scoring functions and threshold it. 
Fig. Q] shows connections between scoring functions with 
correlation > 0.6 (on the LiveJournal network). We ob- 
serve that scores naturally group into four clusters. This 
means that scoring functions of the same cluster return 
heavily correlated values and quantify the same aspect of 
connectivity structure. Overall, none of the scoring func- 
tions are negatively correlated, which means that none of 
them systematically disagree. Interestingly, Modularity is not 
correlated with any other scoring function (Avg. degree is 
the most correlated at 0.05 correlation). We observe similar 
results in other all data sets. 

The experiment demonstrates that even though many 
different structural definitions of communities have been 
proposed, these definitions are heavily correlated. Essentially 
there are only 4 different structural notions of network 
communities as revealed by Fig. [T] For brevity in the rest 
of the paper we present results for 6 representative scoring 
functions (denoted as blue nodes in Fig. [TJ: 4 from the two 
large clusters and 2 from the two small clusters). 

We also note that here we computed the values of the 
13 scores on ground-truth communities. In reality the aim 
of community detection is to find sets of nodes that maxi- 
mize a given scoring function. Exact maximization of these 
functions is typically NP-hard and leads to its own set of 
interesting problems. (Refer to ifPTl for discussion.) 

III. Evaluation of community 

SCORING FUNCTIONS 

The second main purpose of the paper is to develop an 
evaluation methodology for network community detection. 
Based on ground-truth communities we now aim to compare 
and evaluate different community scoring functions. 

Community goodness metrics. Our goal is to rank different 
structural definitions of a network community {i.e., commu- 
nity scoring functions) by their ability to detect ground-truth 
communities. We adopt the following axiomatic approach. 
First, we define four community "goodness" metrics that 
formalize the intuition that "good" communities are both 
compact and well connected internally while being relatively 
well-separated from the rest of the network. 

The difference between community scoring functions 
from Section HI] and the goodness metrics defined above 
is that a community scoring function quantifies how 
community-like a set is, while a goodness metric in an 
axiomatic way quantifies a desirable property of a commu- 
nity. A set with high goodness metric does not necessarily 
correspond to a community, but a set with high community 
score should have a high value on one or more goodness 
metrics. In other words, the goodness metrics shed light on 



various (in many cases mutually exclusive) aspects of the 
network community structure. 

Using the notation from Section [TTJ we define four good- 
ness metrics g(S) for a node set S: 

• Separability captures the intuition that good communi- 
ties are well-separated from the rest of the network lETl . 
|9l , meaning that they have relatively few edges point- 
ing from set S to the rest of the network. Separability 
measures the ratio between the internal and the external 
number of edges of S: g{S) = 

• Density builds on intuition that good communities are 
well connected |9 |. It measures the fraction of the edges 
(out of all possible edges) that appear between the 
nodes in S, g(S) = , mg lW , . 

i)\ I ns[ns-l)/2 

• Cohesiveness characterizes the internal structure of the 
community. Intuitively, a good community should be 
internally well and evenly connected, i.e., it should 
be relatively hard to split a community into two sub 
communities. We characterize this by the conductance 
of the internal cut. Formally, g(S) = maxg' C s 4>{S') 
where </>(£') is the conductance of 5" measured in 
the induced subgraph by S. Intuitively, conductance 
measures the ratio of the edges in S' that point outside 
the set and the edges inside the set S' . A good com- 
munity should have high cohesiveness (high internal 
conductance) as it should require deleting many edges 
before the community would be internally split into 
disconnected components ifTTll . 

• Clustering coefficient is based on the premise that 
network communities are manifestations of locally in- 
homogeneous distributions of edges, because pairs of 
nodes with common neighbors are more likely to be 
connected with each other 



Experimental setup. We are interested in quantifying how 
"good" are the communities chosen by a particular scoring 
function f(S) by evaluating their goodness metric. We 
formulate our experiments as follows: For each of 230 
networks, we have a set of ground-truth communities Si. 
For each community scoring function f(S), we rank the 
ground-truth communities by the decreasing score f(Si). 
We measure the cumulative running average value of the 
goodness metric g(S) of the top-fc ground-truth communities 
(under the ordering induced by /(S,)). 

The intuition for the experiments is the following. A 
perfect community scoring function would rank the com- 
munities in the decreasing order of the goodness metric 
and thus the cumulative running average of the goodness 
metric would decrease monotonically with k. While if a 
hypothetical community scoring function would randomly 
rank the communities, then the cumulative running average 
would be a constant function of k. 

Experimental results. We found qualitatively similar results 
on all our datasets. Here we only present results for the 
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Figure 2. Cumulative average of goodness metrics for Livejoumal 
communities ranked by each of the six representative scoring functions. 

LiveJournal network. Results are representative for all other 
networks. We point the reader to the extended version of the 
paper OD for a complete set of results. 

Figure |2(a)| shows the results by plotting the cumulative 
running average of separability for LiveJournal ground-truth 
communities ranked by each of the six community scoring 
functions. Curve "U" presents the upper bound, i.e., it plots 
the cumulative running average of separability when ground - 
truth communities are ordered by decreasing separability. We 
observe that Conductance (C) and Cut Ratio (CR) give near 
optimal performance, i.e., they nearly perfectly order the 
ground-truth communities by separability. On the other hand, 
we observe that Triad Participation Ratio (T) and Modularity 
(M) score ground-truth communities in the inverse order of 
separability (especially for k < 100), which means that they 
both prefer densely linked sets of nodes. 

Similarly, Figures 12b), (c), and (d) show the cumulative 
running average of community density, cohesiveness and 
clustering coefficient. We observe that all scoring functions 
(except Modularity) rank denser, more cohesive and more 
clustered ground-truth communities higher. For the density 
metric, the Fraction over median degree (D) score performs 
best for high values of k followed by Conductance (C) 
and Flake-ODF (F). In terms of cohesiveness and clustering 
coefficient, the Triad Participation Ratio (T) score gives 
by far the best results. In all cases the only exception 
is the Modularity which ranks the communities in nearly 
reverse order of the goodness metric (the cumulative running 
average increases as a function of k). We note that these are 
all well-known issues of Modularity [10] but they get further 
attenuated when tested on ground-truth communities. 

The curves in Figure [2] illustrate the ability of the scoring 
functions to rank communities. To quantify this we perform 
the following experiment. For a given goodness metric g and 



Scoring function 


Separability 


Density 


Cohesiveness 


Clustering 


Conductance (C) 


1.0 


3.5 


3.4 


3.1 


Flake-ODF (F) 


3.9 


3.6 


3.5 


4.3 


FOMD (D) 


4.9 


3.0 


2.9 


2.9 


TPR (T) 


4.5 


2.3 


2.1 


1.2 


Modularity(M) 


4.0 


5.5 


5.7 


3.9 


CutRatio (CR) 


2.6 


3.1 


3.2 


5.5 



Table II 

Average scoring function rank for each goodness metric. 

for each scoring function /, we measure the rank of each 
scoring function in comparison to other scoring functions at 
every value of k. For example, in Figure |2(a)| the rank at 
k = 100 of Conductance is 1, Cut ratio 2, Flake-ODF 3, 
FOMD 4, Modularity 5, and TPR 6. For every k, we rank 
the scores and compute the average rank over all values of 
k, which quantifies the ability of the scoring function to 
identify communities with high goodness metric. 

Table [TT] shows the average rank for each score and each 
goodness metric. An average rank of 1 means that a partic- 
ular score always outperforms other scores, while rank of 6 
means that the score gives worst ranking out of all 6 scores. 
We observe that Conductance (C) performs best in terms 
of Separability but relatively bad in the other three metrics. 
For Density, Cohesiveness and Clustering coefficient, Triad 
Participation Ratio (T) is the best. Perhaps not surprisingly, 
Triad Participation Ratio scores badly on Separability of 
ground-truth communities. Thus, Conductance is able to 
identify well-separated communities, but performs poorly 
in identifying dense and cohesive sets of nodes with high 
clustering coefficient. On the other hand, Triad Participation 
Ratio gives the worst performance in terms of Separability 
but scores the best for the other three metrics. 

We conclude that depending on the network different 
definitions of network communities might be appropriate. 
When the network contains well-separated non-overlapping 
communities, Conductance is the best scoring function. 
When the network contains dense heavily overlapping com- 
munities, then the Triad Participation Ratio defines the 
most appropriate notion of a community. Further research 
is needed to identify most appropriate structural definitions 
of communities for various types of networks and types of 
functional communities. E.g., in social networks we have 
both identity-based as well as bond-based communities l25l 
and they may in fact have different structural signatures. 

Lastly, in Figure |2] we also observe that the average 
goodness metric of the top k communities remains flat but 
then quickly degrades. We observe the same pattern in all 
our data sets. Thus, for the remainder of the paper we focus 
our attention to a set of the top 5,000 communities of each 
network based on the average rank over the 6 scores. 

IV. Robustness of community 

SCORING FUNCTIONS 

In this section, we evaluate community scoring functions 
using a set of perturbation strategies. We develop a set of 



strategies to generate randomized perturbations of ground - 
truth communities, which allows us to investigate robustness 
and sensitivity of community scoring functions. Intuitively, 
a good community scoring function should be such that 
it is stable under small perturbations of the ground-truth 
community but degrades quickly with larger perturbations. 

Our reasoning is as follows. We desire a community 
scoring function that scores well when evaluated on a 
ground-truth community but scores low when evaluated on 
a perturbed community. In other words, an ideal commu- 
nity scoring function should give a maximal value when 
evaluated on the ground-truth community. If we consider a 
slightly perturbed ground-truth community (i.e., a node set 
that differs very slightly from the ground-truth community), 
we would want the score to be nearly as good as the 
score of the original ground-truth community. This would 
mean that the scoring function is robust to noise. However, 
if the ground-truth community is perturbed so much that 
it resembles a random set of nodes, then a good scoring 
function should give it a low score. 

Community perturbation strategies. We proceed by defin- 
ing a set of community perturbation strategies. To vary 
the amount of perturbation, each perturbation strategy has 
a single parameter p that controls the intensity of the 
perturbation. Given p and a ground-truth community defined 
by its members S, the community perturbation starts with S 
and then modifies it {i.e., changes its members) by executing 
the perturbation strategy p\S\ times. We define the following 
perturbation strategies: 

• NodeSwap perturbation is based on the mechanism 
where the community memberships diffuse from the 
original community through the network. We achieve 
this by picking a random edge (u, v) where u G S and 
v G" S and then swap the memberships (i.e., remove 
u from S and add v). Note that NodeSwap preserves 
the size of S but if v is not connected to the nodes in 
S, then NodeSwap makes S disconnected. 

• Random takes community members and replaces them 
with random non-members. We pick a random node 
u G S and a random v $ S and then swap the member- 
ships.Like NodeSwap, Random maintains the size 
of S but may disconnect S. Generally, RANDOM will 
degrade the quality of S faster than NodeSwap, since 
NodeSwap only affects the "fringe" of the community. 

• Expand perturbation grows the membership set S by 
expanding it at the boundary. We pick a random edge 
(u, v) where u G S and v S and add v to S. 
Adding v to S will generally decrease the quality of 
the community. EXPAND preserves the connectedness 
of S but increases the size of S. 

• Shrink removes members from the community bound- 
ary. We pick a random edge (u,v) where u £ S,V $ S 
and remove u from S. SHRINK will decrease the 
quality of S and result in a smaller community while 



preserving connectedness. 

For a given S, let h(S,p) denote a perturbed version of the 
community generated by the perturbation h of intensity p. 

We now quantify the difference of the score between the 
unperturbed ground-truth community and its perturbation. 
We use the Z-score, which measures in the units of standard 
deviation how much the scoring function changes as a func- 
tion of perturbation intensity p. For ground-truth community 
Si, the Z-score Z(f, h,p) of community scoring function / 
under perturbation strategy h with intensity p is: 



Z(f,h,p) = 



Ei[HSi) - f(h(Sj,p))] 
y/VanlfMSup))] ' 



where Ei[-), Vari[-] are the mean and the variance over 
communities Si, and f(h(Si,p)) is the community score 
of perturbed Si under perturbation h with intensity p. 
To measure f(h(Si,p)), we run 20 trials of h(Si,p) and 
compute the average value of /. Z-score is the difference 
between the average community score of true communi- 
ties f(Si) and the average community scores of perturbed 
communities f(h(Si,p)) normalized by the standard de- 
viation of community scores of perturbed communities. 
Since f(h(Si,p) are independent for each i, Ei[f(h(Si,p))] 
follows a Normal distribution by the Central Limit The- 
orem. Thus, P(z < Z(f,h,p)) gives the probability that 
Ei[f(h(Si,p))] > Ei[f(Si)] where z is a standard normal 
random variable. We measure / so that lower values mean 
better communities, i.e., we add a negative sign to TPR, 
Modularity and FOMD. High Z-scores mean that Ei[f(Si)] 
is likely to be smaller than Ei[f(h(Si,p))] and that Si is 
better than h(Si,p) in terms of /. 

Experimental results. For each of the 6 community scoring 
functions, we measure Z-score for perturbation intensity p 
ranging between 0.01 and 0.6. This means that we randomly 
swap between 1% and 60% of the community members and 
measure the Z-score for each scoring function. For small 
p, small Z-scores are desirable since they indicate that the 
scoring function is robust to noise. For high perturbation in- 
tensities p, high Z-scores are preferred because this suggests 
that the community scoring function is sensitive, i.e., as the 
community becomes more "random" we want the scoring 
function to significantly increase its value. 

Figure [3] shows the Z-scores of LiveJournal communities 
as a function of perturbation intensity p. We plot the Z- 
score for each of the 6 community scoring functions. As 
expected, the Z-scores increase with p, which means that as 
the community gets more perturbed, the value of the score 
tends to decrease. However, the faster the increase the more 
sensitive and thus the better the score. For example, under 
the NODES WAP perturbation Conductance (C) exhibits the 
highest Z-score after p > 0.2, and it has the steepest curve. 
Triad Participation Ratio (T) also exhibits desirable behavior. 
On the other hand, Modularity (M) score does not change 
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Figure 3. Z-scores as a function of the perturbation intensity. Conductance 
(C) and Triad Participation Ratio (T) best detect the perturbations of 
LiveJournal ground-truth communities. 



Scoring function 


NodeSwap 


Random 


Expand 


Shrink 


Conductance (C) 


1.06 


1.59 


0.50 


0.45 


Flake-ODF (F) 


0.51 


1.15 


0.11 


0.41 


FOMD (D) 


0.18 


0.57 


0.19 


0.12 


TPR (T) 


0.37 


1.85 


0.74 


0.21 


Modularity(M) 


0.23 


0.14 


0.03 


0.15 


CutRatio (CR) 


0.53 


0.83 


0.13 


0.43 



Table III 

Average absolute increment of the Z-score between small 
and large community perturbations. best performing scores 
are bolded. 



much as we perturb the ground-truth communities. This 
means that Modularity is not good at distinguishing true 
communities from randomized sets of nodes. We note very 
similar results on all of the remaining datasets considered in 
this study. Refer to the extended version for details OTI . 

Sensitivity of community scoring functions. We also 
quantify the sensitivity of community scoring functions by 
computing the increase of the Z-score between small (p = 
0.05) and large perturbations (p = 0.2). As noted above, we 
prefer a community scoring function with fast increase of the 
Z-score as the community perturbation intensity increases. 
Table [En] displays the difference of the Z-score between a 
large and a small perturbation: Z(f, h, 0.2) — Z(f, h, 0.05). 
We compute the average increment across all the 230 
networks. A high value of increment means that the score is 
both robust and sensitive. The score is robust because even 
at small perturbation (p = 0.05) it maintains low Z-value, 
while at large perturbation (p = 0.2) it has high Z-value and 
thus the overall Z-score increment is high. 

Conductance is the most robust score under N ODES WAP 
and Shrink. The Triad Participation Ratio (T) is the most 
robust under RANDOM and EXPAND. In both cases Conduc- 
tance follows them closely. 



Algorithm 1 Community detection from a seed node 
Require: Graph G(V,E), seed node s, scoring function / 

(1) Compute a random walk scores r u from seed node s 
using PageRank-Nibble Q. 

(2) Order nodes u by the decreasing value of r u /d(u), 
where d(u) is the degree of u. 

(3) Compute the community scoring function f(Sk) of 
the first k nodes fk = f(Sk = {ui\i < k}) for every k. 

(4) Detect local minimal of f(Sk) and detect one or more 
communities 

if we want to detect one community then 

Find the index k* at the first local optima of fk. 
return S — {v. t \i < k*} 

else 

Find the indices k* at every local optima of fk- 
return Sj = {v t \i <k*} 
end if 



V. Discovering communities from a seed node 

Now we focus on the task of inferring communities given 
a single seed node. We consider two tasks that build on 
two different viewpoints. The first task is motivated by a 
community-centric view where we discover all members of 
community S given a single member s e S. The second 
task is motivated by a node-centric view where we want to 
discover all communities that a single node s belongs to. 
This means we discover both the number of communities s 
belongs to as well as the members of these communities. 

Proposed method. We extend the local spectral clustering 
algorithm ll28l . Q into a scalable parameter-free community 
detection method. The benefits of our method are: First, the 
method requires no input parameters and is be able to auto- 
matically detect the number of communities as well as the 
members of those communities. Second, the computational 
cost of our method is proportional to the size of the detected 
community {not the size of the network). Thus, our method 
is scalable to networks with hundreds of millions of nodes. 

Our method (Algorithm [1} builds on the findings in Sec- 
tions [HI] and [IVj First, we aim to find sets of well-connected 
nodes around node s. We achieve this by defining a local 
partitioning method based on random walks starting from a 
single seed node Q. In particular, we use the PageRank- 
Nibble random walk method that computes the PageRank 
vector with error < e in time 0(1/ e) independent of the 
network size (3). The nodes with high PageRank scores 
from s correspond to the well-connected nodes around 
s. Moreover, the random work is "truncated" as it sets 
PageRank scores r u to for nodes u with r u < e, for some 
small constant e (2). This way the computational cost is 
proportional to the size of the detected community and not 
the size of the network. 

After the PageRank-Nibble assigns the proximity scores 
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Figure 4. Two community scoring functions / (Conductance) and /' 
(Triad Participation Ratio) evaluated on a set Sk of top k nodes with 
highest random walk proximity score to seed node s. Local optima of 
f(Sk) correspond to detected communities. 

r u , we sort the nodes in decreasing proximity r u and 
proceed to the second step of our algorithm which extends 
the approach of Spielman and Teng l28l . We evaluate the 
community score on a set Sk of all the nodes up to fc-th one 
(note that by construction Sk-i C Sk)- This means that for 
a chosen community scoring function / we compute f(Sk) 
of the set Sk that is composed of the top k nodes with 
the highest random walk score r u . The local minima of the 
function f(Sk) then correspond to extracted communities. 

We detect local minima of f(Sk) using the following 
heuristic. For increasing k = 1,2,..., we measure f(Sk)- 
At some point k*, f(Sk) will stop decreasing and this k* 
becomes our "candidate point" for a local minimum. If 
f(Sk) keeps increasing after k* and eventually becomes 
higher than af(Sk*), we take k* as a valid local mini- 
mum. However, if f(Sk) goes down again before it reaches 
af(Sk*), we discard the candidate k*. We experimented 
with several values of a and found that a = 1.2 gives good 
results across all the datasets. 

For example, Fig.|4]plots f(Sk) for two community scor- 
ing functions / (Conductance) and /' (Triad Participation 
Ratio). We identify the local optima (denoted by stars and 
squares) and use the nodes in the corresponding sets Sk as 
the detected communities. 

Note that our method can detect multiple communities 
that the seed node belongs to by identifying different local 
minima of f(Sk)- However, we assume that the communities 
are nested (smaller communities are contained in the larger 
ones) even though the ground-truth communities may not 
necessarily follow such a nested structure. Also, note that 
our method is parameter- free. Our method differs from local 
graph clustering approaches J2], ll28l in two important as- 
pects. First, instead of sweeping only using Conductance, we 
consider sweeping using other scoring functions. Second, we 
find the local optima of the sweep curve instead of the global 
optimum — this change gives a large improvement over the 
conventional local spectral clustering approaches 0, flUJ. 

Detecting a community from a single member. We first 
consider the task where we aim to reconstruct a single 
ground-truth community S based on one member node s. 
For each community S, we pick a random member node s 
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DETECTING COMMUNITIES FROM A SEED NODE. 



as a seed node and compare the community we detect from 
s with the ground-truth community S. Starting from node 
s, we generate a sweep curve f(Sk)- Let k* be the value 
of k where f(Sk) achieves the first local minima. We then 
use the set Sk* as the detected community. Now, given the 
ground-truth community S and the detected community Sfc* , 
we evaluate the precision, the recall and the Fl -score. We 
consider 6 community scoring functions /(•). We compare 
the performance of our method to two standard community 
detection methods: Local Spectral clustering (LC) J2j, and 
the 3-clique Clique Percolation Method (CPM) |231 . 

Table [IV] shows the performance of the proposed method 
for each scoring function and for the two baselines. First 
5 rows show the Fl-score for each of the datasets, and 
the last 3 rows show the average Fl-score, precision and 
recall over all the datasets. We observe that the Conduc- 
tance (C) gives the best average Fl-score, and outperforms 
all other scores on LiveJournal (LJ), Orkut, Amazon, and 
Ning. For Friendster (FS) and DBLP, the Triad participation 
ratio (T) performs best. This agrees with our intuition that 
for networks, like LiveJournal, that have fewer community 
overlaps scoring functions that focus on good separability 
perform well. In networks where nodes belong to multiple 
communities (like DBLP where authors publish at multiple 
venues), the Triad participation ratio (T) performs best. We 
also note that the average Fl-score of Conductance is 0.46, 
while the baselines CPM and LC achieve Fl-score of only 
0.36 and 0.37, respectively. Note this is 10% absolute and 
30% relative improvement over the state of the art baselines. 

Last, we observe that some methods detect larger commu- 
nities than necessary (higher recall, lower precision). Mod- 
ularity (M) most severely overestimates community size. 
Conductance (C) and both baselines (CR and CPM) exhibit 
similar behavior but to a lesser extent. On the contrary, 
Flake-ODF (F), Fraction over median (D), Triad Participa- 
tion Ratio (T), and CutRatio (CR) tend to underestimate the 
community size (higher precision than recall). 

Detecting all communities that a seed node belongs to. We 

also explore the second task where we want to detect all the 
communities to which a given seed node s belongs. In this 
task, we are given a node s that is a member of multiple 
communities, but we do not know which and how many 



communities s belongs to. We detect multiple communities 
by detecting all the local minima (and not just the first one) 
of the sweep curve. This way our method both detects the 
number as well as the members of communities. 

For each data set, we sample a node s, detect communities 
Sj, and compare them to the ground-truth communities Si 
that node s belongs to. To measure correspondence between 
the true and the detected communities, we match ground- 
truth communities to detected communities by the Hungarian 
matching method lfl5l . We then compute the average Fl- 
score over the matched pairs. We use Conductance as the 
community scoring function and report results in Table [V] 

Note that this task is harder than the previous one as here 
we aim to discover multiple communities simultaneously. 
Whereas the previous task evaluated our method for each 
ground-truth community, here we first sample node s and 
then search for the communities Si that s belongs to. 
Therefore, larger ground-truth communities will be included 
in Si more often. Since larger ground-truth communities are 
less well separated lfl8l this makes the task harder. 

Table [V] reports the average Fl-score as a function of 
the number of communities g that the seed node s belongs 
to. Given that this is a harder task, we observe lower 
values of the F-score. Intuitively we also expect that the 
task becomes harder as s belongs to more communities. 
In fact we observe that the performance degrades with 
increasing g. Interestingly, in LiveJournal and Amazon it 
appears to be easier to detect communities of nodes that 
belong to 2 communities than to detect a community of a 
node that belongs to only a single community. This is due 
to the fact that single community nodes reside on the border 
of the community and consequently Conductance produces 
communities that are too small lfl8l . 

VI. Conclusion 

The lack of reliable ground-truth gold-standard communi- 
ties has made network community detection a very challeng- 
ing task. In this paper, we studied a set of 230 different large 
social, collaboration and information networks in which we 
defined the notion of ground-truth communities by nodes 
explicitly stating their group memberships. 

We developed an evaluation methodology for comparing 
network community detection algorithms based on their 
accuracy on real data and compared different definitions 



of network communities and examined their robustness. 
Our results demonstrate large differences in behavior of 
community scoring functions. Last, we also studied the 
problem of community detection from a single seed node. 
We examined class of scalable parameter-free community 
detection methods based on Random Walks and found that 
our methods reliably detect a ground-truth communities. 

The availability of ground-truth communities allows for a 
range of interesting future directions. For example, further 
examining the connectivity structure of ground-truth com- 
munities could lead to novel community detection meth- 
ods l30l . Overall, we believe that the present work will 
bring more rigor to the evaluation of network community 
detection, and the datasets publicly released as a part of this 
work will benefit the research community. 
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